library(readr)
library(ggplot2)
library(ggmosaic)
## Error in library(ggmosaic): there is no package called 'ggmosaic'
library(dplyr)
titanic <- read_csv("https://mac-stat.github.io/data/titanic.csv")Simple logistic regression (Notes)
STAT 155
Notes
Learning goals
By the end of this lesson, you should be able to:
- Explain the differences between linear regression and logistic regression for modeling binary outcomes
- Construct simple logistic regression models in R
- Interpret coefficients in simple logistic regression models
- Use simple logistic regression models to make predictions
- Describe the form (shape) of relationships on the log odds, odds, and probability scales
Readings and videos
Choose either the reading or the videos to go through before class.
- Reading: Sections 4.1-4.3 in the STAT 155 Notes
- Video: Logistic regression (slides)
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.
Exercises
Context: The Titanic was a British passenger ship that famously sank in 1912 after hitting an iceberg in the North Atlantic Ocean. Approximately 2200 passengers were on board the Titanic, and it’s estimated that 1500 of them did not survive the crash. Historians have worked diligently to collect data on the passengers that were aboard the Titanic.
We have data for 1313 passengers, where the following information is available for each passenger:
Name: namePClass: ticket class (1st, 2nd, 3rd)Age: age (years)Sex: binary sex (female, male)Survived: indicator that the passenger survived (1 = survived, 0 = died)
Our question of interest is: how do different factors relate to survival?
In the Console, run install.packages("ggmosaic") to install the ggmosaic package that we’ll be using to make a specialized type of plot.
Exercise 1: Exploring age
Did younger passengers tend to have higher survival rates than older passengers?
Visualizing the relationship between a binary response and a quantitative predictor can be tricky. We will take a few approaches here.
- Create a boxplot where one box corresponds to the age distribution of survivors and the second to that of non-survivors.
- Create density plots with separate colors for the survivors and non-survivors.
- The remainder of the code below creates a plot of the fraction who survived at each age. (Since we have a large data set and multiple (though sometimes not many) observations at most ages, we can manually calculate the survival fraction.
After inspecting the plots, summarize what you learn.
# Create a boxplot
# Note that you'll need to force R to view Survived as a binary categorical variable by using x = factor(Survived) instead of just x = Survived in the aes() part of your plot
# Create a density plot (you'll need to use factor(Survived) again)
# Use the code below to create a plot of the fraction who survived at each age
titanic_summ <- titanic %>%
group_by(Age) %>%
summarize(frac_survived = mean(Survived))
ggplot(titanic_summ, aes(x = Age, y = frac_survived)) +
geom_point() +
geom_smooth(se = FALSE)Exercise 2: Exploring sex and ticket class
Were males or females more likely to survive? Did 1st class passengers tend to survive more than 2nd and 3rd class passengers?
The code below creates plots that allow us to explore how Sex and PClass relate to survival. The first two plots are standard bar plots that use color to indicate what fraction of each group survived. The last two plots are mosaic plots that are much like the standard bar plots, but the width of the bars reflects the distribution of the x-axis variable. (The widest bar is the most prevalent category.)
Summarize what you learn about the relationship between sex, ticket class, and survival.
# Standard bar plots
ggplot(titanic, aes(x = Sex, fill = factor(Survived))) +
geom_bar(position = "fill")
ggplot(titanic, aes(x = PClass, fill = factor(Survived))) +
geom_bar(position = "fill")
# Mosaic plots
ggplot(data = titanic %>% mutate(Survived = as.factor(Survived))) +
geom_mosaic(aes(x = product(Sex), fill = Survived))
## Error in geom_mosaic(aes(x = product(Sex), fill = Survived)): could not find function "geom_mosaic"
ggplot(data = titanic %>% mutate(Survived = as.factor(Survived))) +
geom_mosaic(aes(x = product(PClass), fill = Survived))
## Error in geom_mosaic(aes(x = product(PClass), fill = Survived)): could not find function "geom_mosaic"Exercise 3: Linear regression model
For now we will focus on exploring the relationship between (ticket) class and survival.
Let’s tabulate survival across classes. We can tabulate across two variables by providing both variables to count():
titanic %>%
count(PClass, Survived)
## # A tibble: 7 × 3
## PClass Survived n
## <chr> <dbl> <int>
## 1 1st 0 129
## 2 1st 1 193
## 3 2nd 0 160
## 4 2nd 1 119
## 5 3rd 0 573
## 6 3rd 1 138
## 7 <NA> 0 1- Use the
count()output to fill in the following contingency table:
| Class | Died | Survived | Total |
|---|---|---|---|
| 1st Class | ___ | ___ | ___ |
| 2nd Class | ___ | ___ | ___ |
| 3rd Class | ___ | ___ | ___ |
| Total | ___ | ___ | ___ |
- Using your table, estimate the following:
- the probability of surviving among 1st class passengers
- the probability of surviving among 2nd class passengers
- the probability of surviving among 3rd class passengers
- the difference in the probability of surviving, comparing 2nd class passengers to 1st class passengers (i.e., how much lower is the probability of 2nd class passengers as compared to 1st class passengers?)
- the difference in the probability of surviving, comparing 3rd class passengers to 1st class passengers (i.e., how much lower is the probability of 3rd class passengers as compared to 1st class passengers?)
- After fitting the linear regression model below, write out the model formula using correct notation. Explain carefully what it means to talk about the expected/average value of a binary variable.
lin_mod <- lm(Survived ~ PClass, data = titanic)
summary(lin_mod)
##
## Call:
## lm(formula = Survived ~ PClass, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5994 -0.1941 -0.1941 0.4006 0.8059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59938 0.02468 24.284 < 2e-16 ***
## PClass2nd -0.17286 0.03623 -4.772 2.03e-06 ***
## PClass3rd -0.40529 0.02975 -13.623 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4429 on 1309 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1315, Adjusted R-squared: 0.1302
## F-statistic: 99.09 on 2 and 1309 DF, p-value: < 2.2e-16- Write an interpretation of each of the coefficients in your linear regression model. How do your coefficient estimates compare to your answers in part b?
Exercise 4: Logistic regression model (categorical predictor)
- Refer back to your contingency table from Exercise 3a. Using your table, estimate the following:
- the odds of surviving among 1st class passengers
- the odds of surviving among 2nd class passengers
- the odds of surviving among 3rd class passengers
- the ratio of the odds of surviving, comparing 2nd class passengers to 1st class passengers (i.e., how many times higher/lower is the odds of survival among 2nd class passengers as compared to 1st class passengers?)
- the ratio of the odds of surviving, comparing 3rd class passengers to 1st class passengers
- After fitting the logistic regression model below, write out the model formula using correct notation.
log_mod <- glm(Survived ~ PClass, data = titanic, family = "binomial")
coef(summary(log_mod))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4028778 0.1137246 3.542574 3.962427e-04
## PClass2nd -0.6989281 0.1660923 -4.208071 2.575600e-05
## PClass3rd -1.8265098 0.1480705 -12.335410 5.839072e-35- Write an interpretation of each of the exponentiated coefficients in your logistic regression model. Think carefully about what we are modeling when we fit a logistic regression model. How do these exponentiated coefficient estimates compare to your answers in part a?
Exercise 5: Logistic regression model (quantitative predictor)
Now we will explore how to interpret a quantitative predictor in a logistic regression model.
- After fitting the logistic regression model below, write out the model formula using correct notation.
log_mod <- glm(Survived ~ Age, data = titanic, family = "binomial")
coef(summary(log_mod))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.08142783 0.17386170 -0.4683483 0.63953556
## Age -0.00879462 0.00523158 -1.6810637 0.09275054- Write an interpretation of each of the exponentiated coefficients in this logistic regression model.
Exercise 6: Linear vs. logistic modeling
To highlight a key difference between linear vs. logistic modeling, consider the following linear and logistic regression models of survival with sex and age as predictors in addition to ticket class.
lin_mod2 <- lm(Survived ~ PClass + Sex + Age, data = titanic)
coef(summary(lin_mod2))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.130522829 0.051940872 21.765573 8.158449e-82
## PClass2nd -0.207433817 0.039239825 -5.286308 1.637737e-07
## PClass3rd -0.393344488 0.037709874 -10.430809 7.001373e-24
## Sexmale -0.501325667 0.029419802 -17.040416 2.697807e-55
## Age -0.006004789 0.001105949 -5.429536 7.633977e-08
log_mod2 <- glm(Survived ~ PClass + Sex + Age, data = titanic, family = "binomial")
coef(summary(log_mod2))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.75966210 0.397567324 9.456668 3.179129e-21
## PClass2nd -1.29196240 0.260075781 -4.967638 6.777324e-07
## PClass3rd -2.52141915 0.276656805 -9.113888 7.948131e-20
## Sexmale -2.63135683 0.201505379 -13.058494 5.684093e-39
## Age -0.03917681 0.007616218 -5.143868 2.691392e-07Use the linear regression model to predict the probability of survival for Rose (a 17 year old female in 1st class) and Jack (a 20 year old male in 3rd class). Show your work.
Now use the logistic regression model to predict the survival probability for Rose and Jack. Show your work. (Hint: use the logistic regression model to obtain the predicted log odds, exponentiate to get the odds, and then convert to probability.)
Comment on differences that you notice in the predictions from parts a and b.
Reflection
What binary outcomes might be relevant in your project? What predictor(s) could be relevant in a logistic regression model for that outcome?
Response: Put your response here.
Done!
- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework!